Predicting Transfer Fee of a Football Player

"In 2019, the football transfer market reached unprecedented levels both in terms of the number of transfers and the amount spent on fees.."

James Kitching,Director of Football Regulator

Growing football industry can cause clubs spend higher than what they possibly win. UEFA brought out Financial Fair Play Regulations to audit the expenses of clubs and prevent them to collapse financially.UEFA explained those regulations as improving the overall financial health of European club football.To follow the rules in the regulation, valuation of a player becomes crucial either to sell or to buy.

Fifa published a report in 2019, showing the statistics about transfers in 2019. In the report, it is highlighted many times that football teams can give lots of money to be in the competition with others.

Spending on transfer fees by year (USD billion)
Spending on transfer fees by year (USD billion)

Above chart shows how fast the football transfer industry is growing.It reached ~7.5 billion USD in 2019.

: Number of transfers with fees and average transfer fee by year
Number of transfers with fees and average transferfee by year (USD billion)

Above table shows the numbers of transfers are increasing, average fees are increasing as well as years pass.

What makes a football player expensive? Can the fee be predicted? I decided to go on these questions and by using the world-wide known website "transfermarkt" I built a model on transfer fees. I started firstly scraping transfer data from the transfermarkt

Web Scraping and Feature Engineering

I decided to scrape all the transfers with fee,player profile all the matches a football player played before transfer date, achievements and player transfer history using Beautifusoup. All detailed scraping codes can be reachable in my github repository

Below I will reason behind scraping specific link and the features i produced after those scraping codes.

Transfers

Transfers are the main table in which there are all the transfer history of all leagues that transfermarkt has. I decided to scrape all the transfers from the link. This link shows the transfers in each date as shown below.

I looped from all the pages and dates from 2016 to 2020. Finally i eliminated loans & Free transfers to get paid transfers. Data pattern of transfer data is :

Transfer Details and Teams Played

Transfer details are the source of the transfer history of a player. Also in each transfer, specific info is provided such as age at the time of transfer, market value at the time of transfer, remaining contract .The transfer detail website for a player can be reached from here.

I looped for all the players in paid transfer dataset.I also created some other features like Number of transfers in transfer history, Avg Market Value, Total Fee, Avg Fee etc. I also created flag features such as: Country_Change_Flag, League_Change_Flag,League_Tier_Up_Flag,League_Tier_Down_Flag

I also created another dataset which shows the team a player played in each date.

Data patterns of transfer detail and team played datasets are :

Player Profile

Player profile is providing the general information about the footballer such as citizenship, height,preferred foot, positions etc. You can reach a player's profile page from here. Below screenshot is showing available information about a player's profile.

The data pattern for profile is:

Stats in Club Matches

This was the hard part of the project. In transfermarkt, stats are given in each season, however a footballer can change his club during the season. Since I wanted to get the stats before the transfers, I could not use the stats of website, so i decided to create my own features by scraping all the matches a footballer played. The scraping code is available in github, however dataset is not available, since it's bigger than github's limit.The matches are available in the link

Later I created features like , how many times of a player played,avg minutes, goals, assists, benched times, substituted games,injuries,suspensions,match points, etc in last 5,10,20,30 games before transfer date.. To do this i firstly worked on "WinLoseDraw.ipynb" to get win/draw or lose information. Then I created features in "Stats_Features.ipynb" in feature engineering folder.

The data pattern of stats features datasets is:

Stats in National Matches

A football player can also play for his national team. National team stats can also be helpful to predict transfer fee. Here the difficult part of this dataset is to get the matches of players, since they can play multiple under levels also like Under 21, Under 19 etc. Here is the link of national matches. I first scraped the clubs in "national_urls_scraper.ipynb" then I looped for different national level clubs and for all players to create dataset.

The remaining part of national stats matches and features are just like club matches.

The dataset pattern is like club matches dateset:

Achievements

Finally, achievements are also very important for predicting transfer fee. The fee can increase if the player wins top goal scorer awards, or cups. I decided to scrape achievements from the link

In Feature engineering folder I created number of achievements and number of distinct achievements of a player as features to be used in the prediction model.

The data pattern for achievement dataset is:

Data Exploration

Let's explore the dataset a little bit.First i will take head,tail and describe for the dataframe. Then, I will ask 4 questions to the data and get the answers of each by visualizing.

1. How many players are in the data in each season? (2016-2020)

2. What is the average fee in each year?

4. What is the average value of positions?

4. What is the average fee in each positions?

Data Preprocessing

Let's explore the dataset a little bit. How many players exist in each position? Let's use team_position field first to find answers to our questions.

1. Number of days since last played

Some time difference features created in national stats and stats section have timedelta64 data type. In the below, i get the "days" from the data.

2. Foot

In the dataset, foot is specified as categorical variable. I am going to create dummy variables for foot which will give if the player is left footed, right footed or use both.

3.Main Positions

Main position of a football player is categorical, i will get dummy variables for each main position.

4.NOF Positions

In the profile dataset, positions footballer can play is specified as main position, other position1 and other position2. I created a new variable as number of positions by using those fields.Every football player has main position but, some football players can playin multiple positions.

5.Country Left and Country Joined

In each transfer, country left and country joined fields are available. I will define a threshold 50 and get dummy variable for each field if value count of country is greater than threshold.Below 50 is gathered to be "uncommon" field.

6.Contract Left at the time of transfer

In each transfer, fee of transfer should be highly correlated to the remaining contract time of footballers in the existing team. This value is given for some football players but format is need to be changed. I calculated remaining months of existing contract.

7.Contract Left at the time of transfer

In each transfer, fee of transfer should be highly correlated to the remaining contract time of footballers in the existing team. This value is given for some football players but format is need to be changed. I calculated remaining months of existing contract.

8.Age

Age field should be formatted to use in the model.

9.Height

Height field should be changed and formatted to be float.

10.Market Value at the time of transfer

Market value at the time of the transfer should be formatted

11.Fee

Fee is the target variable, should be float

Merge Dummy Variables and Drop Unnecessary Fields

Some columns are removed not to use in prediction.

Modelling: Predicting Transfer Fee

Let's try to predict the value of the players using the features. I will use LightGBM and Keras for prediction. After i will compare the results of these 2 algorithms

Prediction of the Transfer Fee : LightGBM Regressor

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Leaf-wise (Best-first) Tree Growth

Most decision tree learning algorithms grow trees by level (depth)-wise, like the following image:

LightGBM grows trees leaf-wise. It will choose the leaf with max delta loss to grow. Holding #leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.

Leaf-wise may cause over-fitting when data is small, so LightGBM includes the max_depth parameter to limit tree depth. However, trees still grow leaf-wise even when max_depth is specified.

Source : LightGBM

Keras

Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.

Keras is the high-level API of TensorFlow 2.0: an approchable, highly-productive interface for solving machine learning problems, with a focus on modern deep learning. It provides essential abstractions and building blocks for developing and shipping machine learning solutions with high iteration velocity.

Keras empowers engineers and researchers to take full advantage of the scalability and cross-platform capabilities of TensorFlow 2.0: you can run Keras on TPU or on large clusters of GPUs.

Source : Keras

Results